Skip to content

cp: chore(fix): Deployment parallelism (2189) into r0.3.0#2406

Closed
ko3n1g wants to merge 1 commit intor0.3.0from
cherry-pick-2189-r0.3.0
Closed

cp: chore(fix): Deployment parallelism (2189) into r0.3.0#2406
ko3n1g wants to merge 1 commit intor0.3.0from
cherry-pick-2189-r0.3.0

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Feb 17, 2026

beep boop [🤖]: Hi @ko3n1g 👋,

we've cherry picked #2189 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • Changes

    • Simplified deployment configuration by removing parallelism parameters.
    • Adjusted experiment name length constraints for specific cluster configurations.
  • Chores

    • Reformatted evaluation pipeline command for improved readability.

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 17, 2026

/ok to test 02b5f5d

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

Three configuration-related scripts are modified: parallelism arguments removed from evaluation deployment commands, the command string reformatted for readability, and the maximum experiment name length adjusted for dgxc clusters.

Changes

Cohort / File(s) Summary
Evaluation Pipeline
examples/evaluation/deploy.sh, examples/evaluation/launch_evaluation_pipeline.py
Removed tensor_model_parallel_size, pipeline_model_parallel_size, and context_parallel_size arguments from deployment configuration. Command in launch script reformatted to multi-line f-string with preserved functionality.
Performance Setup
scripts/performance/setup_experiment.py
Reduced maximum experiment name length from 37 to 33 characters for dgxc cluster configurations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested labels

r0.3.0

Suggested reviewers

  • erhoo82
  • malay-nagda
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title references PR #2189 and indicates a cherry-pick operation, but does not clearly describe the actual change—removing parallelism arguments from a deployment script. Consider simplifying the title to focus on the primary change, such as 'Remove parallelism arguments from deployment script' or 'Fix deployment configuration for r0.3.0'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed The PR contains only minor changes: code formatting, a constraint adjustment on experiment name length, and a deployment configuration simplification. These are maintenance-level modifications that do not constitute major changes requiring comprehensive test result documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-2189-r0.3.0

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/evaluation/launch_evaluation_pipeline.py (1)

14-14: ⚠️ Potential issue | 🟡 Minor

Shebang line should be on line 1, before the copyright header.

The #!/usr/bin/env python3 shebang on line 14 is placed after the copyright comment block. For the OS to recognize the interpreter directive, it must be the very first line of the file. This is a pre-existing issue but worth fixing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/launch_evaluation_pipeline.py` at line 14, Move the
interpreter directive to the very top of the script so the OS recognizes it:
place the line "#!/usr/bin/env python3" as line 1 of
launch_evaluation_pipeline.py (before any copyright or comment block), removing
the existing duplicate later in the file; ensure there are no blank lines or BOM
before the shebang.
🧹 Nitpick comments (3)
examples/evaluation/launch_evaluation_pipeline.py (1)

114-123: Reformatted command looks correct; minor: missing space before | on line 118.

The multi-line reformatting is a readability improvement. However, there's a missing space before the pipe on line 118 ({args.num_gpus}| tee), unlike line 122 which has {args.parallelism} | tee. While bash handles this fine, consistency is nice.

Consistency fix
-            {args.num_gpus}| tee -a deploy.log & \
+            {args.num_gpus} | tee -a deploy.log & \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/launch_evaluation_pipeline.py` around lines 114 - 123,
The f-string command block used for launching the evaluation pipeline has a
missing space before the pipe after {args.num_gpus} causing an inconsistency
(see the multiline f-string assigned to command in
launch_evaluation_pipeline.py); edit that string to insert a space so it reads
"{args.num_gpus} | tee -a deploy.log & \\" to match the later
"{args.parallelism} | tee -a eval.log" and keep formatting consistent.
scripts/performance/setup_experiment.py (1)

387-389: Consider extracting the magic number 33 into a named constant.

The truncation limit changed from 37 to 33, but the rationale for the specific value isn't documented. A named constant (e.g., DGXC_MAX_EXP_NAME_LENGTH = 33) near the top of the file would make the constraint self-documenting and easier to find if it needs to change again.

Proposed refactor

Near the top of the file (e.g., after line 60):

DGXC_MAX_EXP_NAME_LENGTH = 33

Then at line 388:

-        exp_name[:33] if dgxc_cluster is not None else exp_name
+        exp_name[:DGXC_MAX_EXP_NAME_LENGTH] if dgxc_cluster is not None else exp_name
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/setup_experiment.py` around lines 387 - 389, Replace the
magic number 33 with a named constant: define DGXC_MAX_EXP_NAME_LENGTH = 33 near
the top of the file (e.g., after imports or around line 60) and update the
truncation site that uses exp_name and dgxc_cluster to use
DGXC_MAX_EXP_NAME_LENGTH (exp_name = exp_name[:DGXC_MAX_EXP_NAME_LENGTH] if
dgxc_cluster is not None else exp_name); add a brief comment next to the
constant explaining the Kubernetes name-length constraint so future maintainers
can find and change it easily.
examples/evaluation/deploy.sh (1)

1-16: Missing shebang and NVIDIA copyright header.

The script lacks both a shebang line (#!/bin/bash) and the NVIDIA copyright header. While these are pre-existing omissions (not introduced by this PR), the coding guidelines require an NVIDIA copyright header on all shell scripts, and best practice calls for a shebang. Consider adding them in a follow-up.

As per coding guidelines: "Add NVIDIA copyright header to all Python files and shell scripts at the top of the file."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/evaluation/deploy.sh` around lines 1 - 16, Add a bash shebang and
the required NVIDIA copyright header at the top of this shell script (deploy.sh)
so it complies with coding guidelines; place the shebang (#!/bin/bash) as the
very first line and immediately below it insert the standard NVIDIA
copyright/header block used across the repo, then preserve the existing logic
that references MEGATRON_CHECKPOINT, NUM_REPLICAS, NUM_GPUS and the python
invocation without other changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Line 14: Move the interpreter directive to the very top of the script so the
OS recognizes it: place the line "#!/usr/bin/env python3" as line 1 of
launch_evaluation_pipeline.py (before any copyright or comment block), removing
the existing duplicate later in the file; ensure there are no blank lines or BOM
before the shebang.

---

Nitpick comments:
In `@examples/evaluation/deploy.sh`:
- Around line 1-16: Add a bash shebang and the required NVIDIA copyright header
at the top of this shell script (deploy.sh) so it complies with coding
guidelines; place the shebang (#!/bin/bash) as the very first line and
immediately below it insert the standard NVIDIA copyright/header block used
across the repo, then preserve the existing logic that references
MEGATRON_CHECKPOINT, NUM_REPLICAS, NUM_GPUS and the python invocation without
other changes.

In `@examples/evaluation/launch_evaluation_pipeline.py`:
- Around line 114-123: The f-string command block used for launching the
evaluation pipeline has a missing space before the pipe after {args.num_gpus}
causing an inconsistency (see the multiline f-string assigned to command in
launch_evaluation_pipeline.py); edit that string to insert a space so it reads
"{args.num_gpus} | tee -a deploy.log & \\" to match the later
"{args.parallelism} | tee -a eval.log" and keep formatting consistent.

In `@scripts/performance/setup_experiment.py`:
- Around line 387-389: Replace the magic number 33 with a named constant: define
DGXC_MAX_EXP_NAME_LENGTH = 33 near the top of the file (e.g., after imports or
around line 60) and update the truncation site that uses exp_name and
dgxc_cluster to use DGXC_MAX_EXP_NAME_LENGTH (exp_name =
exp_name[:DGXC_MAX_EXP_NAME_LENGTH] if dgxc_cluster is not None else exp_name);
add a brief comment next to the constant explaining the Kubernetes name-length
constraint so future maintainers can find and change it easily.

@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Feb 26, 2026

will be merged via #2509

@ko3n1g ko3n1g closed this Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant